\section{Preliminaries}
\label{sec:def}

\subsection{Problem Definition}
\label{sec:probdef}

% http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf

In this paper, we will denote the length of the stream by $n$; since it is possible to maintain the length of a stream using only $\lceil \log{n}\rceil$ bits, we assume that this value is available after processing the stream. The streams are of real-valued numbers that can be stored with unit cost (with some fixed precision), but cannot have any computation other than the standard comparison operations performed on them. We denote the items by the values $\{X_1, X_2, X_3, \ldots, X_n\}$. In order to simplify notation later, we assume that $X_1 \leq X_2 \leq X_3 \leq \ldots \leq X_n$ and that the stream is presented in some order $[X_{\pi(1)}, \ldots, X_{\pi(n)}]$, where $\pi$ is an arbitrary permutation of the values $\{1, \ldots, n\}$, i.e., the stream is not sorted in any particular order.

Now, in order to define the problem addressed in this paper, we remind the reader about the definition of several terms from probability theory. The {\it distribution function} (sometimes referred to as the {\it cumulative distribution function} or c.d.f.) of a distribution is defined as the function 
$$F(x) = \mathrm{Pr}(X \leq x),$$
where $X$ is a random variable drawn from the distribution. The {\it empirical distribution function} of a series of observations of some random quantity $X_1, \ldots, X_n$ is defined as 
$$F_n(x) = \frac{|\{i~|~X_i \leq x\}|}{n}.$$
Both the distribution and the empirical distribution functions are defined over all of $\mathbb{R}$ and take values within the range $[0, 1]$.

% The (one-sample) KS-test. 

The one-sample KS test compares a set of discrete data with a fixed, continuous distribution function to see if the data are drawn from this distribution. Specifically, if the empirical distribution function of a stream of length $n$ is given by $F_n$, then the KS-statistic indicating the distance between this empirical distribution and some fixed distribution $F$ is given by 
$$D_n = \sup_x |F_n(x) - F(x)|.$$ 

\begin{comment}
If the set of points are $X_1 \leq \ldots \leq X_n$ and the distribution is defined by the c.d.f.\ $F$, then the (one-sample) KS-statistic is defined as 
$$D_n = \max_{1 \leq i \leq n}  \max{\left(  \frac{i}{n} - F(X_i), F(X_i) - \frac{i-1}{n}\right)}.$$
The KS-statistic is often broken up into its ``upper'' and ``lower'' parts:
$$D_n^+ = \max_{1 \leq i \leq n} \left{ \frac{i}{n} - F(X_i) \right}$$
and 
$$D_n^- = \max_{1 \leq i \leq n} \left{F(X_i) - \frac{i-1}{n}\right\},$$
and can be re-written as $D_n = \max{(D_n^+, D_n^-)}$. 
\end{comment}

\begin{figure}[tb]
%\vspace{-4cm}
\centering
\includegraphics[width=.4\textwidth, angle=270]{figs/fig1.eps}
%\vspace{-2cm}
\caption{The KS-statistic is the greatest difference between distribution functions}
\label{fig:KS}
\end{figure}

The KS test simply computes the largest absolute difference between the empirical distribution and the distribution it is being tested against. This is illustrated in Figure~\ref{fig:KS}. The value of this statistic can be shown to be independent of the distribution in question (i.e., it is called distribution free) and there are tables of values available for the critical region of the test. That is, for $0 < \alpha < 1$, there is some fixed value $K_{\alpha}$ such that the null hypothesis (the data $\{X_i\}$ is drawn from the distribution $F$) is rejected at level $\alpha$ if $\sqrt{n}D_n > K_{\alpha}$.

Note that it is not necessary to compute the above supremum over infinitely many values of $x$ since the empirical distribution function only changes at $n$ discrete values. Specifically, if the set of points are $X_1 \leq \ldots \leq X_n$ and the distribution is defined by the c.d.f. $F$, then the KS-statistic can alternatively be defined as (see, e.g., Knuth~\cite{knuth})
$$D_n = \max_{1 \leq i \leq n}  \max{\left(  \frac{i}{n} - F(X_i), F(X_i) - \frac{i-1}{n}\right)}.$$

% Two-sample KS-test.

The two-sample KS test allows one to compare two sets of samples and test the likelihood that they came from the same underlying distribution. More formally, assume that we have two collections of points with cardinality $n$ and $m$, and empirical distribution functions $F_n$ and $G_m$, respectively. Then, the two-sample KS statistic for these points is
$$D_{n,m} = \max_x |F_n(x) - G_m(x)|.$$
In practice, the two sample test is more useful because it is seldom the case that we know the distribution of the values being measured a priori. For the two-sample test, the null hypothesis is rejected at level $\alpha$ when $\sqrt{\frac{nm}{n + m}}D_{n,m} > K_{\alpha}$.

\subsection{Quantile Sketches}

The algorithms in this paper will make use of streaming data structures, known as sketches, for computing the quantiles of the values in a stream. We define such a sketch as follows:

\begin{definition}
\label{def:sketch}
A {\em quantile} $\epsilon$-{\em sketch} is a data structure that, given an input $X_1, \ldots, X_n$ ($X_1 \leq X_2 \leq X_3 \leq \ldots \leq X_n$) in arbitrary order in a stream, can then be queried to return, for any $1 \leq i \leq n$, a value $X_j$ such that $j \in [i - \epsilon n, i + \epsilon n]$.
\end{definition}
These sketches are assumed to need less memory than storing all the values exactly (i.e., $o(n)$ bits), and can typically be updated very quickly. 

The following observations are needed by the algorithms later in this paper.
\begin{observation}
\label{obs:sketchvalues}
It is possible to extract from a quantile $\epsilon$-sketch a subset $\{X_{i_1}, \ldots X_{i_k}\} \subseteq \{X_1, \ldots, X_n\}$ (where $X_1 \leq X_2 \leq X_3 \leq \ldots \leq X_n$) such that $i_1 < i_2 < i_3 < \ldots < i_k$ and, for all $1 \leq j < k$, $i_{j+1} - i_j < 2\epsilon n$. 
\end{observation}

\begin{proof}
Since a quantile sketch is guaranteed to return only values from the original data stream, any values that the sketch contains can be extracted via querying for the $i$th largest element (for $1 \leq i \leq n$) from the sketch. (See Appendix~\ref{app:extract} for details.) Let $X_{i_1} \leq \ldots \leq X_{i_k}$ be the values that result from these queries. 

Now, fix any $j \in \{1, \ldots, k-1\}$. If it is the case that $i_{j+1} - i_j > 2\epsilon n$, then there must be some $i'$, $i_j < i' < i_{j+1}$, such that querying the sketch for the $i'$th largest element will give a value with rank error at least $\epsilon n$, a contradiction. Hence, it must be that $i_{j+1} - i_j \leq 2\epsilon n$.
\end{proof}


\begin{observation}
\label{obs:sketcherror}
Given some value $X_i$ returned by a quantile $\epsilon$-sketch (where $X_i$ is the $i$th largest element in the input), it is possible to estimate $i$ to within $\epsilon n$ additive error. 
\end{observation}

\begin{proof}
Performing a binary search among the indices $1, \ldots, n$ for the value $X_i$ will give the desired approximation to the index.
%an index $j$ such that $j \in [i - \epsilon n, i + \epsilon n]$. 
A detailed description of this algorithm is given in Appendix~\ref{app:binsearch}.
\end{proof}
